Statistical Mechanics of Semi— Supervised Clustering in Sparse Graphs 
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We theoretically study semi-supervised clustering in sparse graphs in the presence of pair-wise 
constraints on the cluster assignments of nodes. We focus on bi-cluster graphs, and study the impact 
of semi-supervision for varying constraint density and overlap between the clusters. Recent results 
for unsupervised clustering in sparse graphs indicate that there is a critical ratio of within-cluster and 
between-cluster connectivities below which clusters cannot be recovered with better than random 
accuracy. The goal of this paper is to examine the impact of pair-wise constraints on the clustering 
accuracy. Our results suggests that when the density of the constraints is sufficiently small, their 
only impact is to shift the detection threshold while preserving the criticality. Conversely, if the 
density of (hard) constraints is above the percolation threshold, the criticality is suppressed and the 
detection threshold disappears. 

PACS numbers: 89.75.Hc,89.65.-s,64.60.Cn 



I. INTRODUCTION 



I Most real-world networks have a modular structure, i.e., they are composed of clusters of well connected nodes, with 
relatively lower density of links across different clusters |23l426{ . Those modules might represent groups of individuals 
with shared interests in online social networks, functionally coherent units in protein-protein interaction networks, 
^ , or topic-specific research communities in co-authorship networks. Modularity is important for understanding various 
structural and dynamical properties of networks @, H, [HI [H, [H, . Consequently, the problem of detecting modules. 



[ or clusters, has recently attracted a considerable interest both in statistical physics and computer science communities 



(for a recent comprehensive review of existing approaches see [lo| ) 



£^ [ Along with algorithmic development, recent research has focused on characterizing statistical significance of clusters 



detected by different methods [3, [ij, [la, [23- Another fundamental issue is the feasibility of detecting clusters, 
^ ^ assuming they are present in the network. To be specific, consider the so called planted bi-partition model [5], which is 
a special case of more general family of generative models known as stochastic block-models [16. , .27„ .33.] . In this model 
the nodes are partitioned into two equal-sized groups, and each link within a group and between the groups is present 
with probabilities p and r, respectively. An important question is how well one can recover the latent cluster structure 
in the limit of large network sizes. It is known that in dense networks where the average connectivity scales linearly 
I with the number of nodes N (e.g., p and r are constants), the clusters in the planted partition model can be recovered 
PsJ • with asymptotically perfect accuracy for any p — r > iV~^/^+'^ [5|. Recently, a more general result obtained for a 
larger class of stochastic block-models states that certain statistical inference methods are asymptotically consistent 
' . provided that the average connectivity grows faster than log N [5] . 
' The situation is significantly different for sparse graphs, where the average connectivity remains finite as in the 
T-H , asymptotic limit iV — > oo. Indeed, recently it was shown [32| that planted partition models (of arbitrary topology) 
' with finite connectivities are characterized by a phase transition from detectable to undetectable regimes as one 
^ increases the overlap between the clusters, with the transition point depending on the actual degree distribution of 
the partitions. In particular, let p = a/N, r — j/N, where a and 7 are finite average connectivities within and 
across the clusters, and let pin = a/{a + 7) be the fraction of links that fall within the clusters, so that pin = 1 and 
■ Pin = ^ correspond to perfectly separated and perfectly mixed clusters, respectively. Then there is a critical value 
Pin = 5 + A, A > such that for pi„ < the clusters cannot be recovered with better than random accuracy in the 
asymptotic limit. When the planted clusters have Erdos-Renyi topology, one can show that A oc 1 / y/a + 7 for large 
(a + 7) H- 

From the perspective of statistical inference, this type of phase transition between detectable and undetectable 
regimes is undesirable, as it signals inference instabilities - large fiuctuations in accuracy in response to relatively 
small shifts in the parameters. In [1] it was shown that this instability can be avoided if one uses prior knowledge 
about the underlying group structure. Namely, it was demonstrated that knowing the correct cluster assignments for 
arbitrarily small but finite fraction of nodes destroys the criticality and moves the detection threshold to its intuitive 
(dense-network limit) value pin — ^, or a = 7. This can be viewed as a semi-supervised version of the problem, as 
opposed to an unsupervised version where the only available information is the observed graph structure. 

In practice, semi-supervised graph clustering methods can utilize two types of background knowledge - cluster 
assignment for a subset of nodes [sl. l34| . or pair-wise constraints in the form of must-link [cannot-links) , which imply 
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that a pair of nodes must be (cannot be) assigned to the same group [3|,ll9|. In fact, the latter type of constraints are 
more reahstic in scenarios where it is easier to assess similarity of two objects rather than to label them individually. 
The goal of this paper is to examine the impact of latter form of semi-supervision on clustering accuracy. Specifically, 
we focus on a random network composed of two equal-sized clusters, where the clustering objective can mapped to 
an appropriately defined Ising model defined on the planted partition graph. 

The rest of the paper is organized as follow. In the next section we reduce the block-structure estimation problem 
to an appropriately defined Ising model, and describe the zero temperature cavity method used to study the properties 
of the model. In Section Hill we analytically study a specific case of soft constraints. In Section HVl we consider the 
case of hard constraints, and study it using population dynamics methods. We finish the paper by discussing our 
main results in Section fVl 



II. MODEL 



Consider a graph with two clusters containing A''-nodes each. Each pair of nodes within the same cluster is linked 
with probability p = a/N, where a is the average within-cluster connectivity. Also, each pair of nodes in different 
clusters is linked with probability r — -f/N, where 7 is inter-cluster connectivity. Let = ±1 denote the cluster 
assignment of node i, and let s = (si, . . . , S2n) denote a cluster assignment for all the nodes in the network. Further, 
let A be the observed adjacency matrix of interaction graph of 2N nodes so that Aij = 1 if we have observed a link 
between nodes i and j, and Aij = otherwise. Within the above model, the conditional distribution of observation 
for a given configuration s reads 

p(A|s) =p^+[l-p]'=-r''+[l-r]''- (1) 
Here c+, (c_, are the total number of observed (missing) links within and across the groups, 

c+ = V A,S,^,,^ , c_ = N{N -l)-c+ (2) 

rf+ - -^^^(1 - '^^■..^. ) , d^=N'- d+, (3) 



where 5ij = 1 if z = j and 5ij — otherwise. Let us define J^l = ln[(l — r)/(l — p)], Jl = ln[p/r] + Jnl- Then the 
log of the joint distribution over both observed and hidden variables can be written as follows [3) [13 ■ 



i7(s,A) = -ln[p(A|s)p(s)] = -> . Jl^»j'5s.,s, + > . JnlSs^.s, + H^is) (4) 



Here Jj^ and J^vl stand for the contributions from observed links and non-links, respectively, while the last term 
Ht^{s) encodes prior information about the latent structure one might have. In the scenario considered below, we 
assume that the cluster sizes are known a priori. This constraint can be forced by the appropriate choice of i?7r(s), so 
that any clustering arrangement that violates the size constraint will be disallowed. Then it is easy to check that the 
second term amounts to a constant that can be ignored. Furthermore, since below we are interested in the minimum 
of H{s, A), we can set Jl = ^ without loss of generality. 

We find it convenient to make the bi-cluster nature of the network explicit by introducing separate variables Si = ±1 
and Si — ±1 (i = 1, . . . ,N) for two groups. Then Eq. |4]is reduced to the following Ising Hamiltonian (aside from an 
unessential scaling factor): 

JV JV JV 

H — ^ ^ JijSiSj ^ ^ JijSiSj ^ ^ KijSiSj + Hti^s) . (5) 

i<j i<j i,j 

Here Jij and Jij are the elements on two diagonal blocks of the matrix A describing the connectivity within each 
cluster, whereas Kij-s are the elements on the (upper) off-diagonal block of A that link nodes across the clusters. In 
the unsupervised block-model, they are random Bernoulli trials with parameters p and r. 

To account for background information in the form of pairwise constraints, we use the following form for the prior 
part of the Hamiltonian: 

JV JV 
i?^(s) = -Wm^^[0ijSiSj + OijSiSj] + ^ (f>ijSiSj . (6) 
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where 0ij = 1 {9ij = 1) if the corresponding pair of nodes are connected via a must-link constraint within the first 
(second) cluster, and 9ij — {Bij — 0) otherwise. Similarly, 0^ = 1 if there is a cannot link between corresponding 
nodes in respective clusters, and 0^ = if there is no such link. Here Wm and Wc are the costs of violating a must-link 
and cannot -link constraints, respectively. For the sake of simplicity, below we will choose w,„ — Wc = w, where w is 
a positive integer. 

Below we will assume that the constraints are introduced randomly and independently for each pair of nodes. 
Namely, 9ij,9ij-s and 0y are Bernoulli trials with parameter and /c, respectively. Then the prior part of the 
Hamiltonian can be absorbed into [5] by the following choice for the distribution of the couplings in [5] 

p{J^J) - [1 - p]5{J^J) + p[l ~ f,n]SiJ^J - 1) + pf,n5{J^J - W) (7) 

p{K,j) = [l-r]S{K,,)+r[l~ fc]SiIu, -l) + rfJ{K,j+w) (8) 

We are interested in the properties of the above Hamiltonian [5] in the limit of large N. Below we study it within 
the Bethe-Peierls approximation. Let P{h) {P{h)) denote the probability of an internal {cavity) field acting on an s 
(s) spin. Then we have according to the zero temperature cavity method [2ll . [2^ : 

^(^) = J l[,,^/J0nPiJ0n) J li^^^/KonPiKon) J l[k^,Pihk)hkl[i=,P(9k)g^ 

(9) 



X S 



h-y (j)[hk,Jok]-}^, (j)[gk,Kok] 

1 ^ — ^ h—l 



where (/'[a, 6] = sign(a)sign(6) min[ |a|, |6| ], and where (resp. /ife) are the fields acting on the s-spin from s-spin 
(respectively from other s-spins). 

Using the integral representation of the delta function, performing the integration over the coupling constants, and 
employing the symmetry P{(j) = P{—g), we obtain in the limit N ^ 00 

27r 



X exp 



dgP{g) \a{l ~ /„)e-'^'''S"(s) mi„[|g|,l] _^ ^(^ _ j^yzsign{g)rainM,l\ _^ + ^/Jg^*^ "'^"(3) min[|3| 



Once P{h) is found we obtain the first two moments of Si as 

m = y"p(/i)sign(/i), j P{h)sigr?{h),..., (11) 

Here m is the magnetization averaged over the graph structure (including the constraints) (i.e., averaging over J^, 
Jij and Kij), and Gibbs distribution, which, at zero temperature case considered here it means averaging over all 
configurations of Si and Si that in the thermodynamic limit have — the same (minimal) values of the Hamiltonian 
H. And q is called Edwards- Anderson (EA) order parameter; q differs from 1 due to possible contribution oc 5{h) in 
P{h). Note that the accuracy of the clustering (i.e., probability that a node has been assigned to the correct cluster) 
is simply ^~^]^\ , Thus, |m| — 1 corresponds to perfect clustering, whereas m — means that discovered clusters have 
only random overlap with the true cluster assignments. 

In the following, we assume that the constraints are generated with uniform probability /,„ — fc — p/(Q! + 7), where 
p is the average number of constraints per node. 



III. SOFT CONSTRAINTS 



We first consider the case when violating a constraint carries a finite cost. The most trivial such case is when w — I. 
In this case, the must-link constraints do not yield any additional information. The cannot-link constraints, however, 
help clustering by "flipping" the sign of the corresponding edge, thus favoring anti-ferromagnetic interactions between 
the nodes across different clusters. In fact, it can be shown from Eq. [10] that the only impact of the constraints with 
w = 1 is to renormalize within and across cluster connectivities, (a, 7) — (a + P7,7 — pj)- Recall that the mixing 
parameter is defined as pin = ^T^- Thus, this situation is identical to the un-supervised clustering scenario [l|, 
with renormalized mixing parameter pin — > Pin + — Pin)/{ct + 7). The sole impact of constraints is to shift the 
detection below which clusters cannot be detected with better than random accuracy. In particular, the modified 
threshold coincides with its dense network limit ^ for p = ^^^^ i-2pi„ ^ 
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The role of semi-supervising is qualitatively similar for w = 2 that we consider next. Due to the network sparsity, 
we search for the solution in the form 

CnS{h-n),y c„ = l. (12) 

Let us define the following four order-parameters: 

oo oo 

g = ^[c/c + c-k] , m = ^[cfe - c-k] , <7i = ci + c-i , mi = ci - c_i (13) 

k=l k=l 

Then Equation [TU] can be represented as follows: 

2 

P{h) = j ge-'' n g^(e-+-+e— ) (14) 



n=l 



where 



\/[(a + 7 - p)q + pqiY — [(" + 7 — p)Km + pniiY (15) 
(a + 7 - p)Km + pmi 

yi = -atanh— — ■ — ■ (16) 

(a + 7 - p)(7 + pqi 

X2 = p\/ {q - qi)'^ - (to - mi)2 (17) 

TO — TOi 

j/2 = —atanh (18) 

9-91 

K = ^— ^ EE 2p„ - 1 . (19) 

a + 7 

To proceed further, we express the integrands in the rhs of Eg . [Til through modified Bessel functions, and obtain for 
the coefficients 



c/ 



= e-(a+7)9 J2 /2„-;(xi)/„(a:2)e-^^('-'")-"^= (20) 



Furthermore, combining Equations 1201 with Eq. [13] yield the following closed system for the order parameters of the 
model: 



oo oo 



q = 2e-("+^)« J2 ^-^2„-i(a^i)^n(a;2)cosh[2n2/i-yi/-ny2] (21) 

n— — OO I— I 

oo 

91 - 2e-("+^)« I2n-i{^i)ln{x2)c08h[2ny,~yi-ny2] (22) 

n— — OO 
oo oo 

m = 2e~("+^)« 5I-^2"-;(a^i)^n(a^2)sinh[2nyi-yiZ-ny2] (23) 

n— — OO l — l 

oo 

mi = 2e-("+^)« ^ /2„-i(xi)/„(x2)sinh[2nyi-yi-ny2] (24) 



n— — oo 



We now analyze these equations in more detail. 

Again, the case p = corresponds to the unsupervised scenario studied in [111, and the system Eqs. [2T1 [23l predicts 
a second order phase transitions in parameters to and toi at a critical mixing parameter value Below the critical 
point the magnetization m is zero. Recall that the clustering accuracy (i.e., fraction of correct cluster assignments) 
is given as -^-^y^- Thus, for any pi„ < the estimation cannot do any better than random guessing. 

This transition persists also for p > 0: The presence of the constraints merely shifts the transition point to lower 
values of This is shown in Figure 1(a) where we plot the clustering accuracy as a function of pin for a + 7 = 4. 
At a certain value of p, the detection threshold becomes — ^- If P is increased even further, the model has a 
non-zero magnetization even when pi„ = i . 

Note that this shift suggests highly non-linear effect from the added constraints depending on the network parame- 
ters. Indeed, when the connectivity is close to its critical value, the constraints can significantly improve the clustering 
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accuracy by moving the system away from the critical regime. And when the system is away from the critical region 
to start with, then the addition of the constraints might yield no improvement at all compared to the unsupervised 
scenario. To be more specific, consider Figure 1(a) and consider on the impact of semi-supervision on two different 
network parameters, Pm — 0.7 and pin — 0.65. We observe that, compared to unsupervised clustering, adding p — 1 
constraint per node has drastically different outcomes for those two parameter values. For pi„ — 0.7, the addition of 
the constraint changes the clustering accuracy from 50% (random guessing) to nearly ^ 70%. However, adding the 
same amount of constraints has no impact on accuracy for the network with pin = 0.65. 




To find the position of the phase transition, we linearize Eqs. [23l [24l around m = mi = 

'2n-l 



oo oo 



TO = _2e-("+^)' J2 5]/2„-i(ii)/„(i2) 

'2n~ 1 



n— — oo l—l 

oo 



[(a + 7 — p)Km + pmi] p{m — mi) 

Xl X2 



mi 



.2e-("+^)? hn-l{il)In{i2) 



[{a + 7 — p)Km + pmi] ~ —p(m — mi) 

Xl X2 



Here 



Xl = (a + 7 - /9)g + pqi , X2 = p{q - qi) 



(25) 
(26) 

(27) 



and g, qi are the solutions of Eqs. [5T1 [22] with m = mi = 0. 

Further simplification of Eqs. [211 [Ml is carried out by using the the identity 2zlk {z) — k[Ik-i{z) — Ik+i [z)] . Let us 
define 



Cfc 



In[x2)l2n-k{ 



Xl] 



(28) 



n— — oo 



Then the phase transition corresponds to det A = where the elements of the 2x2 matrix A are as follows: 

All = -l + (5o+5i)(a + 7-/o)K + /3(£o + 2£i+52) (29) 
Al2 = ~p{ci + ~C2) (30) 

A21 = (co -C2)(q! + 7-/9)k + /9(ci -C3) (31) 
^22 = -1+ P{co+h-Cl-C2) (32) 

(33) 

Assume a fixed connectivity a + 7. Then, for each value of p, there is a critical value of mixing parameter below 
which clusters are undetectable. Similarly, for each value pin, there is a critical pc = Pc{Pin) so that for any p < pc 
clusters cannot be detected. In Figure 1(b) we plot the detection boundaries on the {pim p) plane for different values of 
a + 7. For each a + 7, the corresponding boundary separates two regimes, so that points above (above) the separator 
correspond to detectable (undetectable) clusters. 
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IV. CLUSTERING WITH HARD CONSTRAINTS 

We now consider the case of hard constraints, by setting w ~ oo ^. In this case, the cavity equation involves aU 
the order parameters, which makes its analysis more complicated. Instead, we address this case by solving the cavity 
equation using population dynamics. 

We begin with a brief description of population dynamics which is described in more detail in an identical setting 
in [i^. The goal is to estimate the distribution of fields P{h) whose fixed point is Eq. O One starts with a large 
population of J\f random fields, hi, then generates a new random cavity field, ho = '^^^i4'[hk, Jok] — X^fcLi'/'Ifl'fej ^ofc], 
where Jok, Kok are drawn according to some fixed distribution and the other fields, hk, gt are drawn randomly from the 
population. We then replace a random field in the population with our newly calculated cavity field. The stationary 
distribution of this Markov chain corresponds to the fixed point of Eq. [S]for large populations. We can then calculate 
the magnetiza tion using Eq. [TTl approximating P{h) ^ ^hi.h/J^- We used A/" ^ 100, 000 for results depicted in 



Fig. |2(a)|2(b)[ We also compare our results to simulations using synthetic data. After generating random graphs of 



size N — 100, 000, we find the ground state of the Hamiltonian [5] using simulated annealing. 

For a subset of nodes connected by labeled edges, we can determine the relative group membership for any pair 
in the group due to the transitivity of the constraints. Therefore, finding node assignments that satisfy hard link 
constraints on a graph amounts to a two-coloring problem and can be done efficiently. As we add random edges to 
a graph, the size of the connected clusters is well-known Q. For p < 1, most clusters are disconnected and the size 
of the largest cluster is 0(log(A^)). At p = 1, we reach the "percolation threshold" where the size of the largest 
cluster goes like 0{N'^^^). Once p > 1, 0{N) nodes belong to one giant connected component. We will investigate 
the consequences of these different regimes below. 



Looking at the results of Fig. 2(a) we see that for small amounts of supervision, p < 1, the impact of the constraints 
is to shift the detection threshold to smaller values of pi„. Qualitatively, this is no different than the effect of adding 
more unlabeled edges within clusters. This behavior is expected, since adding hard constraints is equivalent to studying 
the same unsupervised clustering problem on a renormalized graph (e.g., merging two nodes that are connected via 
constraints) . This is in contrast to results for prior information on nodes in [l| , which showed that even small amounts 
of node supervision shifted the detection threshold to its lowest possible value pin = 1/2. 

As p — >■ 1, there is a qualitative change in our ability to detect clusters. A large number of nodes, 0{N^^^), are 
connected by labeled edges. If we take the relative labeling of nodes in this largest group as the "correct" one, than 
we have a situation similar to node supervision, which, as discussed, moves the detection threshold to a = 7. While 
this large labeled component suffices to create non-zero magnetization in finite graphs (as seen from the simulated 
annealing results), as N gets large, the effect of this component diminishes. For p > 1, we see that the fraction of 
nodes contained in the largest labeled component suffice to produce non-zero magnetization even at the group-defining 
threshold = 7. 

We define the detection threshold pc as the minimum value of p so that p > pc ^ m(a, 7, p) > 0. In Fig. 2(b) we 
investigate the location of the detection threshold pc for a graph with fixed connectivity a -I- 7 = 4, as a function of 
the cluster overlap, p. For p ^ 1/2, where there is little cluster structure, we see that pc is just greater than 1. As 
noted, this is the regime where a fraction of nodes are fixed by edge constraints. As p — > 1 and the cluster structure 
becomes more well-defined, the necessary number of labeled edges per node falls below 1, where the size of clusters 
connected by edge constraints is only logarithmic in the number of nodes. Finally, pc approaches at p ss 0.786. This 
p is the point at which the magnetization is non-zero, even without any supervision. 

V. DISCUSSION 

We have presented a statistical mechanics analysis of semi-supervised clustering in sparse graphs in the presence 
of pair-wise constraints on node cluster assignments. Our results show that addition of constraints does not provide 
automatic improvement over the unsupervised case. This is in sharp contrast with the semi-supervised clustering 
scenario considered in [l| , where any generic fraction of labeled nodes improves clustering accuracy. 

When the cost of violating constraints is finite, the only impact of adding pair-wise constraints is lowering the 
detection boundary. Thus, whether adding constraints is beneficial or not depends on the network parameters (see 



Figure 1(a) ). For the semi-supervising clustering with hard pair-wise constraints, the situation is similar if the number 



of added constraints is sufficiently small. Namely, for small density of constraints the subgraph induced by the must- 



^ Note that fixing w to some large but finite number of 0{N) will guarantee that all the constraints are satisfied. 
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FIG. 2: (a) Magnetization plotted against the mixinf parameter pi„ for a + 7 = 4 and different p. Lines are generated from 
population dynamics and points are generated from simulated annealing. From bottom to top we have p = 0,0.5, 1, 1.5. (b) 
Dependence of the critical mixing parameter pl^ on p for a + 7 = 4. In contrast to the w = 2 case (dashed line), the detection 
threshold for w = 00 attains its minimum value pin = 1/2 when p is slightly greater than 1. 



and cannot links consists mostly of isolated small components, and the only impact of the added constraints is to 
lower the detection boundary. The situation changes drastically when the constraint density reaches the percolation 
threshold. Due to transitivity of constraints, this induces a non-vanishing subset of nodes (transitive closure) that 
belong to the same cluster, a scenario that is similar to one studied in Ref. [1]. In this case, the detection boundary 
disappears for any 01,7. 

In the study presented here, we assume that the edges are labeled randomly. One can ask whether other, non- 
random edge-selection strategies will lead to better results. Intuition tells us that the answer is affirmative. Indeed, 
in the random case one needs to add p = I additional edges per node in order to have the benefit of transitivity. For 
a given p, a much better strategy would be to choose pN + 1 random nodes (rather than edges), and connect them 
into a chain using labeled edges. This would guarantee the existence of a finite fraction of labeled nodes for any p. 

Finally, it is possible to envision a situation where one has access to two types of information - about cluster 
assignment of specific nodes ,1], and pairwise constraints such as studied in the present paper. Furthermore, this 
information might be available at a cost that, generally speaking, will be different for either type of information. An 
interesting problem then is to find an optimal allocation of a limited budget to achieve the best possible clustering 
accuracy. 
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